Identifying bilingual Multi-Word Expressions for Statistical Machine Translation

نویسندگان

  • Dhouha Bouamor
  • Nasredine Semmar
  • Pierre Zweigenbaum
چکیده

MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a French-English parallel corpus. In addition we introduce three methods aiming to integrate extracted bilingual MWES in MOSES, a phrase based Statistical Machine Translation (SMT) system. We experimentally show that these textual units can improve translation quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Grouping Multi-Word Expressions According To Part-Of-Speech In Statistical Machine Translation

This paper studies a strategy for identifying and using multi-word expressions in Statistical Machine Translation. The performance of the proposed strategy for various types of multi-word expressions (like nouns or verbs) is evaluated in terms of alignment quality as well as translation accuracy. Evaluations are performed by using real-life data, namely the European Parliament corpus. Results f...

متن کامل

Multi-word Expressions in English-Latvian Machine Translation

The paper presents series of experiments that aim to find best method how to treat multi-word expressions (MWE) in machine translation task. Methods have been investigated in a framework of statistical machine translation (SMT) for translation form English into Latvian. MWE candidates have been extracted using pattern-based and statistical approaches. Different techniques for MWE integration in...

متن کامل

Bilingual Multi-Word Term Tokenization for Chinese–Japanese Patent Translation

We propose to re-tokenize data with aligned bilingual multi-word terms to improve statistical machine translation (SMT) in technical domains. For that, we independently extract multi-word terms from the monolingual parts of the training data. Promising bilingual multi-word terms are then identified using the sampling-based alignment method by setting some threshold on translation probabilities....

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Extraction of Bilingual Technical Terms for Chinese-Japanese Patent Translation

The translation of patents or scientific papers is a key issue that should be helped by the use of statistical machine translation (SMT). In this paper, we propose a method to improve Chinese–Japanese patent SMT by premarking the training corpus with aligned bilingual multi-word terms. We automatically extract multi-word terms from monolingual corpora by combining statistical and linguistic fil...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012